Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German

نویسندگان

  • Pierre-Edouard Honnet
  • Andrei Popescu-Belis
  • Claudiu Musat
  • Michael Baeriswyl
چکیده

The goal of this work is to design a machine translation system for a low-resource family of dialects, collectively known as Swiss German. We list the parallel resources that we collected, and present three strategies for normalizing Swiss German input in order to address the regional and spelling diversity. We show that character-based neural MT is the best solution for text normalization and that in combination with phrase-based statistical MT we reach 36% BLEU score. This value, however, is shown to decrease as the testing dialect becomes more remote from the training one.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ArchiMob - A Corpus of Spoken Swiss German

Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety rarely recorded and that it is subject to considerable regional variation. This paper presents a freely available general-purpose corp...

متن کامل

Automatic speech recognition and translation of a Swiss German dialect: Walliserdeutsch

Walliserdeutsch is a Swiss German dialect spoken in the south west of Switzerland. To investigate the potential of automatic speech processing of Walliserdeutsch, a small database was collected based mainly on broadcast news from a local radio station. Experiments suggest that automatic speech recognition is feasible: use of another (Swiss German) database shows that the small data size lends i...

متن کامل

A Resource for Natural Language Processing of Swiss German Dialects

Since there are only a few resources for Swiss German dialects, we compiled a corpus of 115,000 tokens, manually annotated with PoStags. The goal is to provide a basic data set for developing NLP applications for Swiss German. We extended the original corpus and improved its annotation consistency. Furthermore, we trained dialect-specific PoS-tagging models and implemented a baseline system for...

متن کامل

Rhythmic variability in Swiss German dialects

Speech rhythm can be measured acoustically in terms of durational characteristics of consonantal and vocalic intervals. The present paper investigated how acoustically measurable rhythm varies across dialects of Swiss German. Rhythmic measurements (%V, �C, �V, varcoC, varcoV, rPVI-C, nPVIC, nPVI-V) were carried out on four sentences of six speakers from eight Swiss dialects. Results indicate th...

متن کامل

Verb Clusters in Continental West Germanic Dialects

The Continental West Germanic Languages include the standard varieties of Dutch, Frisian, and High German, as well as a large number of non-standard varieties, the more familiar of which are the dialects spoken in Belgium and the South of the Netherlands (Flemish, Brabantish, Limburgian), Northern Germany (Low German), the Rhine Valley (Luxemburgish), South-Eastern Germany and Austria (e.g. Bav...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1710.11035  شماره 

صفحات  -

تاریخ انتشار 2017